Improving protein structure prediction using templates and sequence embedding

Abstract Motivation Protein structure prediction has been greatly improved by deep learning, but the contribution of different information is yet to be fully understood. This article studies the impacts of two kinds of information for structure prediction: template and multiple sequence alignment (MSA) embedding. Templates have been used by some methods before, such as AlphaFold2, RoseTTAFold and RaptorX. AlphaFold2 and RosetTTAFold only used templates detected by HHsearch, which may not perform very well on some targets. In addition, sequence embedding generated by pre-trained protein language models has not been fully explored for structure prediction. In this article, we study the impact of templates (including the number of templates, the template quality and how the templates are generated) on protein structure prediction accuracy, especially when the templates are detected by methods other than HHsearch. We also study the impact of sequence embedding (generated by MSATransformer and ESM-1b) on structure prediction. Results We have implemented a deep learning method for protein structure prediction that may take templates and MSA embedding as extra inputs. We study the contribution of templates and MSA embedding to structure prediction accuracy. Our experimental results show that templates can improve structure prediction on 71 of 110 CASP13 (13th Critical Assessment of Structure Prediction) targets and 47 of 91 CASP14 targets, and templates are particularly useful for targets with similar templates. MSA embedding can improve structure prediction on 63 of 91 CASP14 (14th Critical Assessment of Structure Prediction) targets and 87 of 183 CAMEO targets and is particularly useful for proteins with shallow MSAs. When both templates and MSA embedding are used, our method can predict correct folds (TMscore > 0.5) for 16 of 23 CASP14 FM targets and 14 of 18 Continuous Automated Model Evaluation (CAMEO) targets, outperforming RoseTTAFold by 5% and 7%, respectively. Availability and implementation Available at https://github.com/xluo233/RaptorXFold. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Predicting protein structure from its amino acid sequence has been greatly improved by deep learning. Since our proposal (Wang et al., 2017) of the deep ResNet method for protein contact, distance and structure prediction, many research groups have studied similar deep learning methods, such as DMPfold (Greener et al., 2019), tFold (Shen et al., 2021), AlphaFold1 (Senior et al., 2020) and trRosetta (Yang et al., 2020). At the recent 14th Critical Assessment of Structure Prediction (CASP14), DeepMind presented an attention-based method AlphaFold2 (Jumper et al., 2021) that achieved remarkable accuracy. Template-based modeling (TBM), including protein threading and homology modeling, dominated protein structure prediction for many years until the recent application of deep learning to template-free modeling. Our group first employed deep learning to integrate templates, identified by our inhouse protein threading method (Wu and Xu, 2021), and coevolution information to predict inter-residue distance, which is then used to build a tertiary structure. Blindly tested (as a part of the RaptorX server) in CASP13 and CASP14, our method achieved very good performance on the TBM targets (Wu and Xu, 2021). More recent studies (Jumper et al., 2021;Su et al., 2021;Wu and Xu, 2021;Zheng et al., 2019) confirmed that a combination of templates and deep learning may improve tertiary structure prediction for at least TBM targets even if HHsearch (Steinegger et al., 2019) is used to search for templates and build alignments, although HHsearch does not perform as well as our in-house threading software in identifying weakly similar templates. While supervised methods have produced breakthrough results for protein structure prediction, recent studies (Alley et al., 2019;Heinzinger et al., 2019;Rao et al., 2020) have shown that self-supervised protein language models can predict secondary structure and contacts at a reasonable accuracy. DeepMind (Jumper et al., 2021) also integrated multiple sequence alignment (MSA) embedding in AlphaFold2 but did not make use of a protein language model trained by millions of MSAs. The goal of this article is not to develop the best protein structure prediction method. Instead, we study in more detail how templates and MSA embedding generated by protein language models may improve protein structure prediction and its relationship with other factors, such as the number of templates, the difficulty level of a protein target, the depth of its MSA and its sequence length. Specifically, we trained four attention-based models with different feature combinations using the same training set; 2 Materials and methods 2.1 Overview of the method Figure 1 shows our overall network architecture. Given a target protein under prediction, we search genetic databases to build an MSA. We first generate an MSA representation using one-hot encoding, in which individual amino acids, unknown amino acids and gaps are treated as character-level tokens. Then we employ Facebook's MSATransformer  to generate the MSA embedding and ESM1b (Rao et al., 2020) to generate the sequence embedding. The details of the embedding module can be found in Supplementary Figure S1. We run NDThreader (Wu and Xu, 2021) for protein threading and select the top five template structures and their alignment to the target. Supplementary Figure S2 shows the template embedding module of our method. The template module embeds template information using three axial attentions (row-wise, column-wise and template-wise attention). This template representation is then concatenated with a pairwise representation using a pointwise attention module. The MSA Encoder module is similar to the RoseTTAFold 2D-track network. We add a pairwise decoder layer of 72 ResNet blocks to predict inter-residue relationships. We also use a recycling strategy similar to AlphaFold2. The detail of our method is presented in Supplementary Section S1. To facilitate the ablation study, we implemented our method so that it is easy to turn on and off a specific module, the template module, the MSA embedding module and the recycling strategy. Our model has 73.3M parameters in total.

Build models from predicted distance and orientation
Our deep model predicts distance distributions for Cb-Cb atom pairs and three types of inter-residue orientation presented in trRosetta (Yang et al., 2020). We discretize the distance into 37 bins: 0-2, 2-2.5, 2.5-3, . . ., 19.5-20, >20 Å and use one label to indicate an unknown distance when at least one of the two atoms does not have valid 3D coordinates in the PDB file. We discretize the orientation dihedrals x, h and angle / into 24, 24 and 12 bins uniformly, respectively.
To build 3D models, we convert predicted distance and orientation distributions into distance potential for tertiary structure prediction by energy minimization. Afterward, we use the fast relaxation protocol in PyRosetta (Chaudhury et al., 2010) for sidechain packing and reducing steric clashes.

Training and validation data
We use the following training and validation protein sets. CathS35 is a representative set of CATH domains (https://www.cathdb.info), in which any two domains share no more than 35% sequence identity. The data are available at ftp://orengoftp.biochem.ucl.ac.uk/ cath/releases/dailyrelease/archive/. The version we use for training is v4.2 dated 01/01/2020 and has 32 511 entries. We excluded very short domains and those with too many (>50%) missing Cb and Cb atoms. The protein domains in Cath S35 are shorter than the protein chains in BC40, so it may reduce GPU memory consumption and speed up training by using Cath S35.
PDB clusters all protein chains by MMseq2 (Steinegger and Sö ding, 2017) at 30%, 40%, . . ., 95% and 100% sequence identity each week to remove redundancy. BC40 contains 38 774 clusters, and it is a dataset with a 40% cutoff such that the proteins share no more than 40% sequence identity. In total, there are 90 363 protein chains as of January 1, 2020 used for training. While in training, we randomly pick one protein from each cluster.
We excluded very short domains (<25 AAs) and those with too many (>50%) missing Ca and Cb atoms and split them into two non-overlapping subsets: one for training and the other for validation. In the training stage, we choose random k templates out of the restricted set of n templates, while k ¼ Uniform[1, n].

Independent test data
We evaluate our model using three test sets: CASP13, CASP14 and CAMEO.  (Suzek et al., 2015) as of March 2018 and Metaclust50 (Steinegger and Sö ding, 2018) as of January 2018 to build its MSA.

Template search and alignment generation
We use our in-house protein threading software NDThreader to select the top five templates. During training, we use TMalign to find up to eight top templates for each target and then use NDThreader to generate their alignments to the target. While in training, we randomly select a subset of the eight templates.

Input features
We use the following input features: • seq_feat. It is a one-hot encoding with shape [L, 21] where '21' represents 20 amino acids and one 'unknown' amino acid. • residue_index. It is a feature of size [L] consisting of the residue index. • msa_feat. It is a one-hot encoding with shape [S, L, 22], where '22' indicates 20 amino acids þ unknown þ gap. • template_seq_feat. This is a feature of size [T, L, 11], constructed by the following features extracted from the query and template profile. (i) Amino acid substitution matrix, we use three amino acid substitution matrices BLOSUM80, BLOSUM62 and BLOSUM45 to handle different similarity levels. (ii) Sequence profile similarity, we calculate this by the inner product of position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM) in two directions: query PSFM to template PSSM and template PSFM to query PSSM. Both PSFM and PSSM of the test proteins are derived from the profile HHM built by HHblits (Remmert et al., 2012) with E-value ¼ 0.001 and uni-clust30 dated October 2017. (iii) Residue mask, which a binary denotes the gap in the template for alignment, which is equal to 1 if and only if the target protein residue has an aligned residue on the template. • template_pair_feat. This is a feature of size [T, L, L, 129] and consists of several pairwise features extracted by a template according to its alignment to a target protein. Three types of Euclidean distance matrices are used for three types of atom pairs: Ca-Ca, Cb-Cb and N-O. We discretize the distance between 3.25 and 50.75 Å uniformly into 38 bins with bin width ¼ 1.25 and use one more bin to represent distance > 50.75 Å . We also use 'interval distance', which is calculated by the real distance value minus the lower bound of its bin. We also consider the gap binary matrix for distance. As for the orientation matrix, we employ the inter-residue orientation implemented in trRosetta and calculate the sin and cos values of the orientation angles.

Accuracy of predicted 3D structures when templates are not used
We have trained a model on the Cath S35 data without using templates and MSA embedding and then evaluated the model quality (measured by TMscore and Global Distance Test (GDT)) of the predicted 3D structures. Tables 1 and 2 and Supplementary Table S2 show that our baseline deep model performs similarly to the PyRosetta version of RosetTTAFold. On the CASP13 test set, the average quality (TMscore) of the first and the best models (selected by energy) obtained by our method is 0.790 and 0.805, respectively while the average quality by RoseTTAFold is 0.796 and 0.809, respectively. On the CASP14 test set, the average quality of the first and best models by our method is 0.746 and 0.760, respectively, slightly better than RoseTTAFold, which obtains 0.744 and 0.758. On the CAMEO test set, our method outperforms RoseTTAFold on Hard, Medium and Easy targets by 0.002, 0.019 and 0.003 in terms of TMscore, respectively. Our method performs slightly worse on large TBMHard and TBMEasy targets, possibly because our training proteins on average are shorter than those used by RoseTTAFold. In terms of the TMscore of the first models, our method outperforms RoseTTAFold on 50 of 110 CASP13 targets, 53 of 91 CASP14 targets and 96 of 183 CAMEO targets, respectively. In terms of the GDT, our method outperforms RoseTTAFold on 48 of 112 CASP13 targets, 52 of 91 CASP14 targets and 93 of 183 CAMEO targets, respectively. The RoseTTAFold's end-to-end version performs worse than its PyRosetta version. In our method, we recycle the output pairwise representation of the decoder and the first row MSA representation of the MSA Encoder module. These two types of representations are passed through LayerNorm to update the output of the embedding module. To evaluate the contribution of recycling, we trained another deep model on the same training set but disabling the recycling module. As shown in Supplementary Table S3, the recycling strategy can improve contact prediction significantly. The t-test shows that the difference is statistically significant (P ¼ 2EÀ8). This is because the recycling strategy makes the network deeper and brings better MSA and pairwise representation for the MSA Encoder without increasing the number of parameters (Jumper et al., 2021).

Accuracy of predicted 3D structures when templates are used
When templates are used, we have trained one model without MSA embedding and the other with MSA embedding on the Cath S35 data. As shown in Tables 1 and 2 and Supplementary Table S2, on the FM targets, our deep model with templates performs similarly to our deep method without templates. On the FM/TBM targets, our deep method with templates performs slightly worse than when Fig. 1. Overview of the network architecture employed in this article. Arrows show the information flow direction. The dark arrows indicate that gradient is used, while the light arrows indicate that gradient is not used templates are not used. In particular, our method with templates performs particularly badly on two targets T0953s2-D1 and T1047s2-D1 that is because our threading method did not find a similar template for those targets. On the TBMEasy and TBMHard targets, our method with templates performs significantly better (3-11% better in terms of TMscore) than when templates are not used. Most targets in the CAMEO test set have good templates (TMscore >0.6), our model with templates outperformed the model without templates by 4.8% and 9%, respectively, in terms of TMscore and GDT.
We also run RoseTTAFold with templates and study their impact. We use the same MSA as input and search the template set released in March 2020 for all methods to make the comparison fair. RoseTTAFold uses HHsearch to find templates. Considering that RoseTTAFold only provides the template set released on March 3, 2020, we only tested RoseTTAFold with templates on the CASP14 and CAMEO test sets. Tables 1 and 2 show that template information generated by HHsearch can improve the model quality of the CASP14 TBMHard and TBMEasy targets and all CAMEO targets but degrade the model quality of the CASP14-FM and CASP14-FM/TBM targets. Figure 2 shows the head-to-head comparison between templates used and not used. Overall template information improves 3D modeling accuracy for 71 of 110 CASP13 targets, 46 of 91 CASP14 targets and all 183 CAMEO targets, respectively. The t-test shows that the difference between the model with/without templates is statistically significant (P ¼ 3.8EÀ5 and 1.2EÀ4 when the first-ranked and the best models are considered, respectively). When templates are used, our method outperforms RoseTTAFold on 54 CASP14 targets (out of 91) and 129 CAMEO targets (out of 183) and underperforms on 36 targets and 52 targets. For CASP14 targets, when the first models are evaluated, our method with templates has an average TMscore and GDT of 0.776 and 0.711, respectively, outperforming RoseTTAFold (0.753 and 0.683, respectively). Figure 3 shows the relationship between MSA depth and template quality (i.e. target-template structure similarity). We evaluate the model with the GDT score on CASP13 and CASP14 test sets. Figure 2a shows that templates improve quality for those targets as long as the target-template similarity (TMscore) is >0.6. Figure 2b shows that for some FM and FM/TBM targets with shallow MSA (MSA depth < 100), templates may still improve the model quality. Our in-house threading software NDThreader can find some templates and generate accurate alignments even for some hard targets Note: TM represents TMscore and GDT is GDT_TS scaled to [0, 1]. RoseTTAFold means the pyrosetta version, while RoseTTAFold end-to-end means the end-to-end version. BCData means using BC40 as the training set. We use the same MSA as input for all methods. For the method using template information, we use PDB70 released in March 2020 as the template set. For the distance-based method, we use the same folding script to generate 60 decoys and select the best model by energy. The first number in the table means the model selected by energy, while the second number means the best model selected by TMscore. Sequence embedding is generated by ESM-1b and MSA embedding is generated by MSATransformer. Note: TM denotes TMscore and GDT is GDT_TS scaled to [0, 1]. Easy and Hard targets are defined according to the average lDDT of the corresponding structural model. RoseTTAFold means the pyrosetta version while RoseTTAFold end-to-end means the end-to-end version. BCData means using BC40 as the training set. We use the same MSA as input for all methods. For the method using template information, we use PDB70 released in March 2020 as the template set. For the distance-based method, we use the same folding script to generate 60 decoys and select the model by energy. The first number in the table means the first model selected by energy while the second number means the best model selected by TMscore. Sequence embedding is generated by ESM-1b and MSA embedding is generated by MSATransformer.

Impact of templates on structure prediction
(Supplementary Section S4). For proteins with shallow MSA, although template information is not accurate, template information can bring complementary information to shallow MSA. Figure 4 shows the impact of template information on attention maps for T1093-D2, which is a TBMHard target with 382 residues in CASP14. The template pointwise attention map before MSA Encoder is similar to the distrogram. That is because the template selected by our threading method is similar (TMscore ¼ 0.729 for the first-ranked template) to the target. Figure 3a and b shows the first pairwise attention map with/without template information on the axial transformer layer in the MSA Encoder. Accurate template information brings better initial attention to the MSA Encoder. As shown in Figure 5, template information improves by 0.394 in terms of TMscore for T1093-D2 with high-quality template information.
We also study the impact of the number of templates on structure prediction. As shown in Table 3, templates have little or slightly worse impact on the CASP14 FM and FM/TBM targets. On the TBMHard and TBMEasy targets, when more than five templates are used, the quality of hard targets is worse, while five templates seem to be the best choice for easy targets. On the TBMHard and TBMEasy targets with good templates, once the best template is used, the number of templates has little impact, which shows that our attention-based template module can make use of multiple template information.
To further study the impact of template quality on the quality of predicted 3D structures, we further use the best templates obtained by the structure alignment tool (TMalign) as input. We run TMsearch (Zhang and Skolnick, 2005) for CASP14 targets (assuming that we have their experimental structures) against the PDB database released in March 2020 and choose the 5 templates with the highest TMscore as the best templates. As shown in Table 4, the TMalign-selected templates significantly improve the prediction accuracy of the FM and TBMHard targets. Supplementary Section S4 shows that NDThreader cannot find a good template (TMscore > 0.5) for almost all CASP14 FM targets and some FM/TBM targets. For the TBMEasy targets, the templates selected by NDThreader are good enough and thus, TMalign-selected templates do not have an advantage over NDThreader.
3.4 Accuracy of predicted 3D structures when self-supervised MSA embedding is used Self-supervised methods can learn from billions of sequences and capture extra information not encoded in existing protein structures. When information from self-supervised methods is used, we trained a model with MSA embedding generated by MSATransformer and  the other with sequence embedding generated by ESM-1b on the CathS35 set. These models all use template information.
CASP13 FM targets and CASP14 FM targets are challenging since there is no similar template in PDB and most of them have shallow MSAs. When the first models of the CASP13 FM targets and CASP14 FM targets are evaluated, MSA embedding improves model quality by 0.032 and 0.043 TMscore, respectively. Figure 6 shows the head-to-head comparison between our method with and without MSA embedding. On the TBMEasy and TBMHard targets, our method with MSA embedding produces the first models with an average of 0.907 and 0.909 TMscore, respectively. That is, even on the easy targets for which our method with templates shows good performance and generates accurate models, MSA embedding still helps a little. Sequence embedding can improve the model quality of FM and FM/TBM targets, but it is not as significant as MSA embedding. As shown in Table 2, when the first model of the CAMEO targets is evaluated, the model with MSA embedding does not show any advantage in terms of TMscore and GDT, possibly because most CAMEO targets have a similar template (TMscore > 0.6) in PDB.
We also train a model with MSA embedding on BC40 data. BC40 has more long proteins than the CathS35 data. It improves the model quality slightly. Our method with MSA embedding has an average of 0.803 TMscore and 0.741 GDT, respectively, on the CASP14 test set, outperforming RoseTTAFold's pyrosetta and endto-end versions by 6% and 10%, in terms of TMscore. Supplementary Figure S4 shows the detail of the comparison between our method and RoseTTAFold on the CASP13, CASP14 and CAMEO set. As a control, single AlphaFold2 predictions have an average of 0.866 TMscore and 0.836 GDT, outperforming our method by a large margin. In terms of TMscore, our method outperforms AlphaFold2 on 11 CASP14 targets (out of 91) and underperforms on 80 targets. AlphaFold2 did much better on the FM and FM/TBM targets. AlphaFold2 combined a number of modules to achieve high-accuracy structure prediction, including end-to-end training from MSA to 3D structure, interactively updating sequence and pairwise representation, using triangular attention to update pairwise representations, the invariant point attention (IPA) module, the recycling strategy, the self-distillation data, etc. We also use AlphaFold2 to predict the structure for the sequences in Uniclust30 (Mirdita et al., 2017) and select those predictions with plddt larger than 80 to build an AlphaFold2-distillation dataset for training. This may improve structure prediction by a 0.04 TMscore on the CASP14 test set. Besides, our current method mainly focuses on distance prediction, we also tried the structure module with IPA to predict coordinates directly but did not observe any improvement, possibly because the scale of training data we used is not big enough and the structural module cannot converge completely under the limited training steps.
3.5 Impact of MSA embedding on structure prediction Figure 7 shows the relationship between MSA embedding, the MSA depth and sequence length. We use the CASP13, CASP14 and CAMEO test sets and evaluate the model with the GDT score. The large impact of MSA embedding on the targets with shallow MSA suggests that MSA embedding is making effective use of unlabeled data on distance prediction. MSA embedding significantly improves the accuracy of short proteins with 300 residues. However, for long sequences, MSA embedding has little impact for the following two reasons: (i) the template information is accurate enough for those long sequences; (ii) MSATransformer restricts its training sequence length up to 256 residues.
We also study the relationship between model quality and different self-supervised features (sequence embedding extracted from ESM1b and MSA embedding extracted from MSATransformer) or MSA depth. As shown in Figure 8, the model with sequence embedding is much worse than the model with MSA embedding when the MSA is shallow (MSA depth < 100). As the MSA depth increases,    Note: emb means using MSA embedding generated by MSATransformer. NDT means using template features generated by NDThreader while TM means using template features generated by TMalign (assuming that we have their experimental structures). the quality gap between those two models with different selfsupervised features becomes closer, which is because template information is accurate for targets with deep MSA. MSA embedding can bring more information for structure prediction than sequence embedding, which is also stated by . We make a runtime analysis for targets with lengths ranging from short to large. Supplementary Figure S3 shows that using the embedding extracted from MSATransformer can achieve considerable improvement with little time overhead.

Better templates can improve AlphaFold2 model quality
Instead of using the templates and alignments generated by HHsearch, we modified AlphaFold2 to use the templates and alignments generated by our in-house threading software NDThreader. Table 5 shows that when better templates generated by NDThreader are used, the model quality improves slightly. With the templates identified by NDThreader, AlphaFold2 may improve model quality by 0.007 and 0.008 on the FM targets and TBMHard targets in terms of TMscore, respectively. The t-test shows that the difference is not statistically significant (P ¼ 0.045). For the TBMEasy targets, the template information produced by HHsearch is already good and not further improved by using NDThreader. As shown in Figure 9, Good template information can notably improve the predictions of T1060s3-D1, T1038-D1 and T1099-D1 (i.e. TMscore difference > 0.1) without obviously making any protein worse.

Conclusion
In this article, we integrate better template information and MSA embedding extracted from pre-trained protein language models using an attention-based model and study the impact of template information and self-supervised MSA embedding on deep learningbased protein structure prediction. Our test results on the CASP13,  Note: GDT are scaled to [0, 1]. AF2 represents the original AlphaFold2 without using any templates (model_3_ptm). AF2-TPL represents the original AlphaFold2 using all zero-placeholder template features (model_1_ptm). AF2 þ HH represents the templates generated by HHsearch (model_1_ptm). AF2 þ NDT represents the templates generated by NDThreader. (mod-el_1_ptm). For the AlphaFold2 model, model_1_ptm is trained with templates while model_3_ptm is not. Fig. 9. The head-to-head comparison of a single AlphaFold2 model with different template features on the CASP14 set. (a) Y-axis: use template feature generated by NDThreader; X-axis: use template feature generated by HHsearch. They have the same model parameter. (model_1_ptm, trained with template) (b) Y-axis: use template feature generated by NDThreader (mode_1_ptm, trained with template); X-axis: do not use template feature. (model_3_ptm, finetune without template) CASP14 and CAMEO sets show that both templates and MSA embedding are helpful. We have shown that when a similar template (TMscore > 0.6) is available, our attention-based model can make use of template information and improve structure prediction. The model with templates can improve structure prediction for 52 of 67 CASP14 TBM targets. We found that template information is helpful for targets with shallow MSA (MSA depth < 100). Even for some AlphaFold2-predicted models, better template information can slightly improve model quality.
For the FM targets with shallow MSA, embedding generated by the self-supervised method brings complementary features and improves both distance and structure prediction. Our experimental result shows that MSA embedding can improve structure prediction for 23 of 32 CASP13 FM targets and 20 of 23 CASP14 FM targets.
Overall, the performance of our current model is not comparable with AlphaFold2, but our deep model outperforms RoseTTAFold by 5%, 8% and 5% on the CASP13, CASP14 and CAMEO test sets in terms of GDT score, respectively. Training a deeper model may further improve the performance of our method, we will try this later when computing resources are available.

Author contributions
J.X. conceived the research. F.W., X.J. and X.L. implemented the algorithms. F.W. carried out the benchmarking experiments. F.W. and X.J analyzed the results. Both F.W. and J.X. wrote the manuscript. All authors revised the manuscript.